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Item response theory (IRT) has been adopted as the theoretical foundation of computerized 
adaptive testing (CAT) for several decades. In applying IRT for CAT, however, there are 
certain considerations that are essential, and yet tend to be neglected. In this paper, first these 
essential issues are addressed and discussed, and then several ways of eliminating noise and 
bias in estimating the individual parameter 9 a of person a are proposed and discussed, so 
that accuracy and efficiency in ability estimation be increased. 

I. CONTENT VALIDITY OF THE ABILITY DIMENSION 

[1.1] Necessity of Operational Definition of Ability 9 

There has been a tendency that, once methodologies have been developed in IRT and ac- 
comodated in computer software, researchers apply them rather mechanically, without ques- 
tioning if their target of estimation, ability 9 , is properly defined in the process. Without 
due considerations for this issue, however, all our effort will be meaningless, and we will end 
up with obtaining mere artifacts that are of little psychological and educational significance. 
To give a concrete example, there is no guarantee that the 9 ’s measured by LOGIST and 
BILOG are the same ability even if they are based on the same set of data, and yet very few 
researchers raise this question. 

Thus an operational definition of 9 is by far the most important in applying IRT for 
educational and psychological data. Although ability 9 tends to be simply assumed, and 
its unidimensionality is taken for granted, we must start with defining 9 operationally, and 
confirm its uni dimensionality. 

[1.2] Mathematical Challenge and Contribution to Education 

In developing theories based on mathematics, there usually is a great deal of mathematical 
challenge that motivates psychometricians to work on specific topics. Thus we owe valuable 
outcomes in IRT to those theorists who have accepted such a challenge, conquered difficult 



problems and provided us with methodologies. 

Too much emphasis on mathematical challenge sometimes makes us lose perspective, how- 
ever. Take an example in simultaneous estimation of the individual parameter 9 a of a person 
a and the item parameters following some mathematical model. There is no doubt that this 
topic involves a great deal of mathematical challenge, and yet we must wonder if it is legitimate 
to estimate both individual and item parameters simultaneously. 

It is advisable to keep in mind that our objective is to estimate the individual parameter 0 a , 
and that test items are only tools with which 9 a is estimated. Thus whenever necessity axises 
we can change or replace those human-made test items. In defining ability 6 operationally, 
a set of items that reflects the target ability must be carefully selected so that the content 
validity of the resulting ability dimension be assured. 

[1.3] Core Test Items 

Suppose we have a set of test items whose content validity axe assured from our past research 
findings. Let us call them core test items. If we succeed in extracting a single pricipal common 
factor behind these items, then we may accept it as the operationally defined 0 . If we do not, 
then factor structure of those common factors should be examined, and appropriate deletion 
and/or addition of some items will eliminate minor clusters to provide a single principal common 
factor. 

Ability 0 thus operationally defined should have content validity, and will be used for item 
calibration of all items in the itempool. This is especially useful in on-line item calibration. 
Note that those core items do not have to be included in the itempool. To give an example, 
suppose, for practicality, we need to use only dichotomous response items in CAT. We can still 
include graded response items in the set of core test items, and in fact it is desirable to do so 
because: 
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1. in general, the amount of item information provided by a graded response item is 
greater than that of a dichotomous response item (see Samejima, 1969), and 

2. more logical reasoning processes can be accomodated in a graded response item than 
in a dichotomous response item. 



Insert Figure 1 About Here 



Figure 1 presents a set of example questions taken from LSAT, the Official Prep Test III, 
1991, Vol. 2. We could make a single graded response item out of these questions with the 
grades 0,1,2, 3,4 , as illustrated in Figure 2. In this example, score 1 is given to those who 
found out the positions of J and T directly from the statements (e) and (f), score 2 is given 
to those who discovered, in addition, indeterminancy of the positions of K and L and that 
of X , Y and Z from the statements (b), (c) and (d), score 3 to those who found out the 
position of U , and score 4 to those who discovered the positions of K and L in the if 
situation given by the statement (g). This type of graded response item will be appropriate 
for a core test item because of its abundant item information for a wide range of 9 and the 
fact that it represents logical reasoning processes necessary for grasping both what we can say 
and what we cannot say based on the statements. Note that we could increase the number of 
grade categories to 6, 7 or more if we further elaborate if questions exemplified by (g). 



Insert Figure 2 About Here 



II. ELIMINATION OF NOISE CAUSED BY GUESSING 



[II. 1] Unique Maximum Condition 

Let g (=1,2 , ...,n) denote an item, which elicits any discrete response. Let Pk g (9 ) be 
the operating characteristic of the discrete response K g = k g defined by 

Pk,w = prob. [K, = h, I e\ , ( 1 ) 

with the assumption that Pk a (@) is, at least, five times differentiable with respect to 0 . 
Samejima (1969, 1972) defined the basic function Ak g (9) such that 

Ak,(6) = j^l°g Pk,(») ■ (2) 

Samejima (1973) also defined the item response information function, h g {9) , which is given 

by 

h.(t) = logft,(9) , (3) 

and the item information function I g (0) is obtained as the conditional expectation, given 9 , 
of the item response information function, that is, 

m = £[/*,(«) i «] = i iKw Pk.{0) ■ w 

kg 

Eq. (4) includes Birnbaum’s (1968) item information function for a dichotomous item as a 
special case. 

The response pattern V is given by 

' V = (K„ K„ K, K n ) , (5) 

and due to local independence (Lord & Novick, 1968) the likelihood function L(y \ 9) for 
general discrete responses can be written as 

L(v\0) = P v (0 ) = prob.[V = v\9] = I] Pk 9 {0) , (6) 
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where P v (9) is the operating characteristic of the response pattern V = v . From Eqs. (3) 
and (6) the response pattern information function, I v (6) , is given by 

/.(«) = log P,W = EKW ■ (V 

The test information function, 1(9) , is defined as the conditional expectation of the response 
pattern information function, given 6 , and from Eqs. (3), (4), (6) and (7) we obtain 

/(«)-£[/.(«) I «] = EW) • M 

v g = 1 

It is obvious from Eqs. (2) and (6) that for a test of n graded response items there are only 
£” =1 rrig + n basic functions defined by Eq. (2). Using this small number of basic functions, a 
simple algorithm provides n^=i ( m g + 1) likelihood equations and hence the same numbers of 
maximum likelihood estimates (MLE’s) 9 V ’s . For example, if n = 10 and m g = 2 for all 
items, then 30 basic functions provide MLE 0 V ’s for as many as 59, 049 different response 
patterns. When all items are scored dichotomously, the number of basic functions is 2n and 
they provide 2” MLE’s. 

Samejima (1969, 1972) proposed a sufficient condition for a discrete item response to provide 
a unique local or terminal maximum likelihood estimate for every response pattern consisting 
of such item responses. The condition is that the basic function Ak g (0) , defined by Eq. (2), 
be strictly decreasing in 9 with non-negative and non-positive values for its two asymptotes, 
respectively. For brevity, this condition has often been referred to as the unique maximum 
condition. It is noted from Eqs. (2) and (3) that the first part of this condition can be 
rephrased, that is, the item response information function h a (Q) be positive for all 0 except, 
at most, at an enumerable number of points where it may assume zero. 

It has been shown (Samejima, 1969, 1972) that the unique maximum condition is satisfied 
by both the normal ogive model and the logistic model for dichotomous responses. Let P g (0) 
be the item characteristic curve (ICC), which is defined by 




P g (0) = prob. [U g = 1 | 9} , 
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where U a (= 0, 1) is a binary item score of item g with u g as its realization. In the normal 
ogive and logistic models, ICC’s are given by 

1 fa a (9-b g ) 

™ = -jwL 

and 



u 



exp[ — ] du , 



( 9 ) 



Ptf) = 



1 



( 10 ) 



1 +exp{-Da g (9 - b g )} 

respectively, where a g (> 0) and b g axe the discrimination and difficulty parameters and 
D = 1.702 in Eq. (10) is a scaling factor. 

It has also been shown (Samejima, 1972) that both the normal ogive and logistic models for 
graded responses satisfy the unique maximum condition, and so does Bock’s nominal response 
model (Bock, 1972), which includes both Masters’ partial credit model (Masters, 1982) and 
Muraki’s generalized partial credit model (Muraki, 1992) for graded responses as special cases 
(see Samejima, 1972). It has also been proved that all models that belong to the logistic 
positive exponent family (Samejima, 1997) satisfy the same condition. Thus in these models 
the likelihood function that is based on the response pattern has a unique local or terminal 
maximum for every v €E V . 

It should be noted, however, that the three-parameter logistic model (3PL), whose ICC is 
given by 

P.W =*, + (!-*)*.(*) , (U) 



where 

1 +exp{-Da g (9 - b g )} 

and c g is the third parameter called the guessing parameter, does not satisfy the unique 
ma ximum condition (Samejima, 1972, 1973), and thus for some response patterns the likelihood 
functions may have multi-modes. Yen, Burket & Sykes (1991) have shown that multi-modality 
of the likelihood function occurs not infrequently for response patterns that usually come across 
in empirical data when the 3PL is used. 
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[II.2] Suggestion of the Use of the Normal Ogive Model 

It is a common practice in CAT that the 3PL is adopted as the mathematical model for 
multiple-choice test items in the itempool. Since the third parameter c g in Eq. (11) is nothing 
but noise that lowers the accuracy of estimation of the individual parameter 9 a , as is obvious 
from the fact that the 3PL does not even satisfy the unique maximum condition, it is desirable 
to replace it with some other model that includes less noise and, therefore, provides greater 
accuracy in ability estimation. 

To realize this, first of all we must develop test items whose ICC’s do not include so much 
noise within the framework of multiple choice format. Samejima (1994a) distinguished infor- 
mative distractors from equivalent distractors, and called the operating characteristic of an 
informative distractor the plausibility function. Suppose we have developed an item whose 
distractors have differential information, in the sense that they tend to attract examinees of 
different levels of ability. In practice, it is desirable to include a distractor whose plausibility is 
identified by examinees of substantially high levels of ability, another distractor which attracts 
examinees of slightly lower levels of ability, etc., down to a distractor which attracts examinees 
of very low levels of ability. In the noiseless situation we can treat such an item as a graded 
response item. 



Insert Figure 3 About Here 



Figure 3 illustrates the ICC of such an item by a solid fine, and the plausibility functions 
of the 4 distractors by dashed lines of various lengths in the noiseless situation, following the 
normal ogive model for graded responses (Samejima, 1969, 1972). In this model, the operating 
characteristics, P Xg (0) ’s , are given by 




P.,V) = KA 6 ) - 






9 



( 12 ) 
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where P Xg i^) called the cummulative operating characteristic of the graded item score 
x g (= 0, 1,2, ..., m g ) of item g which is given by 

W-vsrJL exp ^ r l<iu ’ < 13 ) 

where a g (> 0) is the item discrimination parameter, and b Xg is the item response difficulty 
parameter which satisfies 

—oo = bo < b x < ... < b mg < b mg+1 = oo . 

In the present example, the parameters in Eq. (13) are a g = 1.0 and 
b Xg = —1.50, —0.50, 0.00, 0.75, 1.25 respectively. 

In the noiseless situation no guessing occurs, and examinees who do not find plausibility 
in any of the 5 alternative answers are supposed to honestly check the additional category, 
don’t know. The strictly decreasing curve with the longest dashes in Figure 3 represents the 
operating characteristic of this don’t-know category. 

In practice, however, we cannot expect such total honesty, and it is likely that examinees 
in the don’t-know category turn to guessing. Figure 4 presents the operating characteristics of 
the five alternative answers assuming that examinees in the don’t-know group guess randomly. 



Insert Figure 4 About Here 



Figure 5 presents the ICC taken from Figure 4 in comparison with the one following the 
normal ogive model for dichotomous responses whose ICC is given by Eq. (9) with a g = 1.00 
and b g = 1.25 , together with the ICC in the 3PL with c g = 0.2 that fits the ICC in question 
very well except on lower levels of 0 . If we accept the ICC in the 3PL for this item, the critical 
value 6g equals 1.096 , and below this value the uniqueness of the MLE is not assured (see 
Samejima, 1973). The item is, therefore, not appropriate to use for examinees whose individual 
parameters 6 a ’s axe below this value of 0 . 
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Insert Figure 5 About Here 



If we accept the ICC in the normal ogive model as an approximation, however, the fit is 
extremely good for the interval of 6 , (0.3, oo) . Since it is not likely that, in CAT, this item 
is used for examinees whose 8 a ’s are lower than 0.3 at which the ICC is as low as 0.15 , in 
practice this item can be treated as a noise-free item, and the use of normal ogive model will 
be justified. 

Birnbaum (1968) proposed the logistic model whose ICC is given by Eq. (10) as a substitute 
for the normal ogive model. A strength of the logistic model lies in its mathematical simplicity, 
that includes a sufficient statistic u g a g which enables us to obtain the MLE without 

even using a computer. In this day of advanced computer technologies, however, we can use 
the normal ogive model just as easily, so there is no need for any substitute models. 

[II.3] Effective Use of Nonparametric Approach 

In order to find out if the distractors of our item axe informative or not, we must discover 
their plausibility functions. For this purpose, nonparametric approaches for the estimation of 
operating characteristics, which do not a priori assume any mathematical forms, axe by fax the 
most useful. 



Insert Figure 6 About Here 



Figure 6 exemplifies the results obtained by using the simple sum procedure of the condi- 
tional p.d.f. approach (see Samejima, 1998) for the multiple- choice items of the Iowa Level 11 
Vocabulary Subtest. Thus it has been disclosed that the item represented by the upper graph 
has informative distractors, while the distractors of the item represented by the lower graph 



do not provide differential information, and, therefore, are equivalent distractors. Because test 
items are human-made, if in pilot studies we discover that their disctractors belong to the sec- 
ond category, we can replace them by more informative ones that belong to the first category. 
Such pilot studies can be conducted in on-line item calibration, which is appropriate for CAT. 

III. INITIAL TEST ITEMS IN CAT 

Figure 7 presents the square roots of the test information function, which is given by Eq. 
(8), of hypothetical 30 equivalent dichotomous test items following the normal ogive model, 
with the common difficulty parameter b g = 0 for all items of the five tests and the common 
discrimination parameters a g = 0.4, 0.7, 1.0, 1.5, 2.0 for items of the separate tests, respectively. 
This square root of the test information function, yl{9) , is the reciprocal of the asymptotic 
standard error of estimation specified as a function of 9 . 



Insert Figure 7 About Here 



Figure 7 implies that, although the minimal estimation error is smaller when the common 
item discrimination parameter is larger, the interval of 9 for which the error is sufficiently 
small is narrower. Thus contrary to the general belief the use of items with high discrimination 
parameters may not be desirable, especially at the initial stage of CAT where the examinee’s 
estimated individual parameter fluctuates for a relatively wide range. 

This is also supported by the fact that, if items have higher discrimination parameters, the 
MLE bias function, B(9\9 V ) , which is given by 

* . 1 A 

’ v> mwk p,m,w 

for a test of dichotomous items in general where 






Q,(6) = \-P,(6) , 

M 12 



(15) 



has a narrower interval of 9 for which the MLE is practically unbiased. This is illustrated in 
Figure 8 for the same five hypothetical tests of 30 equivalent dichotomous items. 



Insert Figure 8 About Here 



A better solution for this problem than the use of low discrimination items at the initial 
stage of CAT may be the use of several graded response items such as the one illustrated in 
[1.3]. Since in general a graded response item provides a greater amount of information, and 
also a wider interval of 9 for which MLE is practically unbiased, than a dichotomous item, 
its use will be an ideal solution. 

IV. ELIMINATION OF BIAS IN ABILITY ESTIMATION 
[IV. 1] Warm’s Weighted Likelihood Estimate 

A class of Bayesian modal estimators, 9 * , of ability 9 can be defined as the value of 9 
that maximizes 

iH «)/(«) , 

where L(v \ 9) is the likelihood function of a specific response pattern V = v , and f(9) is 
known as a prior. Thus 9 * is the solution of 

j^log L(v\e)+^m =0 . 

When all the n items are scored dichotomously, the response pattern V takes the form of 

V' = (U u Ui, U 3 , , U n ) . 

By local independence the likelihood function L[y | 9) can be written as 

L(v\e) = n P.WQM'"’ . < 16 ) 
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where Q g (9) is given by Eq. (15). 

Lord (1983) proposed the bias function of the MLE, which is denoted by B(9]9 V ) in this 
paper, for 3PL in which the ICC is given by Eq. (11). This bias function is given by 



B(0-A) = t>[/(«r 2 E «,/,(«)[*,(*) --5-1 • 0 7 ) 

3=1 Z 

Warm’s (1989) weighted hkelihood estimate (WLE) was proposed in the effort of minimizing 
the bias of 9* by setting an appropriate prior, which he denoted tu(0) . This prior can be 
expressed by the equation 

^ log »(«) ■= , 

where B(9;9 V ) is the MLE bias function in 3PL given by Eq. (17), and 1(9) is the test 
information function that can be written as 



m = e 



hP.V) 



(18) 



tA P,W <?,(*) 

for general dichotomous responses (Birnbaum, 1968). Thus the WLE, which is denoted by 9 V 
in this paper, is the solution of: 

& , t/ 1 n\ , ® /n\ ST* — a$Pg{^) d ( a. a \r/n\ — n /in\ 

^logL(u I 9) + -qqW(9) = 2^ n ~ n\r\ , a \ B(9\9 V )I(9) = 0 . (19) 



39 



3=1 



p,m,m 



[TV. 2] Expansion of the WLE for General Discrete Responses 

Samejima (1993a, 1993b) expanded Lord’s MLE bias function in 3PL for any discrete re- 
sponses K g ’s , for which the response pattern V is given by Eq. (5). This MLE bias function 
for general discrete responses is given by 

» J.P. (o\ &Lp, (a\ 

( 20 ) 



B(9-9 V ) = 



mow hk. 



PkAO) 



where Pk„(9) and 1(9) are defined by Eqs. (1) and (8), respectively. When K g is replaced 
by the graded item score X g , all k g ’s in Eq. (20) are changed to x g ’s (= 0, 1,2, ...,m 3 ) ; 
when it is replaced by the binary item score U g , Eq. (20) becomes Eq. (14) which includes 
Eq. (17) as a special case. 
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A straight-forward expansion of Warm’s WLE for 3PL to general discrete responses provides 



the solution of: 

^ log M” 1 1) + ^ w W 



ZA k ,(0) - B(0-X)m = 0 

k g €v 



( 21 ) 



as the WLE 6 V , where L(v \ 6) is given by Eq. (6), Ak g (6) is the basic function defined by 

Eq. (2), B(6;6 V ) is the MLE bias function given by Eq. (20), and 1(6) is the test information 
function for general discrete responses provided by Eq. (8). 



[IV. 3] Graphical Comparison of the WLE with the MLE 

A graphical representation of the MLE and Warm’s WLE will make their comparison easy. 
Figure 9 presents the MLE bias function B(6]6 V ) of a hypothetical test of 30 equivalent 
dichotomous items following the normal ogive model, whose ICC is given by Eq. (9), with the 
common discrimination parameter a g = 0.7 and the common difficulty parameter b g = 0.0 , 
respectively, represented by a short dashed fine, and its product with the test information 
function 1(6) by a long dashed fine. In the same figure, also presented are log L(v \ 6) 
for four response patterns, which include 0, 1, 7, and 15 correct answers, respectively. 



Insert Figure 9 About Here 



It is obvious from Eq. (21) that the WLE 9 V of each response pattern is the value of 6 
at which jg log L(v \ 6) crosses the long dashed curve representing B(9\9 V ) 1(6) , while 

A 

the MLE 9 V of the same response pattern is that of 6 at which fg log L(v \ 9) intersects 
the abscissa. Thus the amount of correction of the bias of the MLE is the distance between 
the WLE and the MLE, as illustrated with respect to the response pattern in which only one 
item is correct in Figure 9. It is obvious that the correction makes the estimates of 9 regress 
toward 6 = b g = 0.0 in this example. Note that for the response pattern that consists of 15 



~ A 

correct answers and 15 incorrect answers the correction is nil, and 0 V = 9 V . 

[IV. 4] Straight-Forward Methods of Eliminating Bias 

Lord (1983) suggested a direct correction of the bias of the MLE for the true test score, 
which is a monotone transformation of ability 9 . When applied to the original ability scale 
9 this corresponds to 9 V subtracted by B(9]9 V ) at 9 = 9 V . This correction tends to 
over- compensate the bias, and a more logical correction may be to identify the value of 9 at 
which the discrepancy from 9 V equals the value of the bias function at that point of 9 . This 
can be done by drawing a line from the 9 V with the angle of 45 degrees from the abscissa until 
it reaches the curve of the MLE bias function, and then drawing a line vertical to the abscissa. 
Thus the corrected MLE differs from the original 9 V by the expected amount of bias at that 
point of 9 . 

The relationships among the MLE 9 V , the two corrected MLE’s and Warm’s WLE 0 V are 
also illustrated in Figure 9. It should be noted that the difference between the two corrected 
MLE’s can be substantially large where the MLE bias function assumes a steep curve. 

[IV. 5] Usefulness of Warm’s Weight Function as a Prior 

These straight-forward corrections of 9 V makes us feel as if Waxm’s WLE were unnecessary. 
Note, however, that these two corrected MLE’s cannot be obtained either for the all-correct 
response pattern or for the all-incorrect response pattern, while Eq. (21) provides WLE’s for 
these extreme response patterns also. 

In Bayesian estimation of ability 9 , it is customary for reseaxchers in psychology and 
educational psychology to use the density function representing the ability distribution of some 
population to which the examinee belongs. Some researchers even believe that, because such a 
Bayesian estimation of ability uses additional information (i.e., the prior), the resulting ability 
estimate should be more accurate than the MLE. 




This idea contains several serious problems, however. First of all, as Samejima (1969) and 
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Lord (1986) pointed out, the use of such a prior increases the amount of bias of the ability 
estimate. Secondly, since such a prior is based on rather trivial factors, such as gender, age, etc., 
this could lead to serious social and ethical problems. As Lord (1986) stated, the examinee’s 
estimated ability depends not only on his/her test performance but also on the nature of 
the entire group in which he/she happens to be included !; if the group as a whole is a low 
ability group, the examinee’s ability estimate may regress downward; if it is a high ability 
group, his/her estimated ability may regress upward. Thus one may lose a job opportunity 
if the priors represent gender differences, for example, while he/she may earn the job if they 
represent socio-economic statuses. 

To avoid this, it may be advisable to customize a prior for each individual examinee, by 
talcing the intersection of many different attributes, until finally no one else belongs to the 
prior than the examinee. If such a prior can be identified, however, there will be no need for 
testing. 

Strengths of the prior used for the WLE axe that: 

1. the prior is intrinsic in the test and, therefore, nothing beside the examinee’s test 
performance is used in ability estimation, and no unfair discrimination against any 
individual examinees will arise, and 

2. its use will eliminate bias in ability estimation rather than increase it, so it can be 
used effectively in CAT as well as in paper-and-pencil testing. 



V. MODIFIED TEST INFORMATION FUNCTION 

In CAT, it has been a widely used practice to adopt a set amount of test information in 
the stopping rule. That is to say, when the amount of test information of the individually cus- 
tomized subset of items selected from the itempool has reached that criterion amount at which 

the examinee’s individual parameter is currently estimated, no more items will be presented, 

O 
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and testing will be over. This will be legitimate when the individual parameter 9 a of the 
examinee lies within the interval of 9 where the MLE bias function is practically nil, but it 
requires some modification when it lies outside of this interval. 

Samejima proposed two modification formulae of the test information function (see Same- 
jima, 1994b). These modifications axe given by 

m = /(e)(i + ^ fl ( M „)]- 2 

and 

m = m {[1 + A B(8 ; «„)]’ + 1(8) [B(8-, «„)] 2 }-' , (22) 

respectively, where 1(9) is the test information function defined by Eq. (8) and B(9 ; 9 V ) is 
the MLE bias function specified in Eq. (20). This second modified test information function, 
3(9) , represents an approximate minimum bound of the mean squared error of the MLE, and 
the amount of correction is greater for values of 9 at which unbiasedness of the MLE is more 
pronounced. Figure 10 illustrates the square root of the E(0) defined by Eq. (22) by a dotted 
fine, in comparison with the square roots of the original test information function 1(9 ) and 
also T (9) which are drawn by solid and dashed lines, respectively, for the 43 multiple-choice 
test items of the Iowa Level 11 Vacabulary Subtest, following the logistic model. It will be 
desirable to use 3(9) instead of 1(9) in the stopping rule of CAT, when the MLE is used for 
the estimate of the examinee’s individual parameter. 



Insert Figure 10 About Here 



VI. DISCUSSION 



In this paper, in the effort of improving methods of applying IRT in practical situations, 
especially in CAT, the content validity of the ability dimension was emphasized, and the idea of 
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core test items was proposed. Devices are proposed to eliminate noise from multiple-choice test 
items by making use of the nonparametric estimation of operating characteristics effectively 
in pilot studies, and use of the normal ogive model instead of 3PL was suggested. It was 
recommended to use several graded response items as those presented at the beginning of CAT 
in order to avoid the influence of bias and lack of information intrinsic in dichotomous response 
items. Warm’s WLE and its expanded form for general discrete responses were discussed as 
an effective method of eliminating bias in ability estimation, and the usefulness of Warm’s 
weight function as a prior was discussed. Use of the modified test information function was 
also suggested for the same purpose. 

The author hopes that these methods suggested in the present paper will be tested by re- 
searchers in education in actual computerized adaptive testing, and the results will be compared 
to find out how well each device, or combinations of devices, will work. 
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Questions 13-19 



15 . 



Eight benches— J, K, L, T, U, X, Y, and Z— are 
arranged along the perimeter of a park as shown below: 



North 



West 




O East 



South 



The following is true: 

J, K, and L are green; T and U are red; X, Y, and 
Z are pink. 

The green benches stand next to one another along the 
park’s perimeter. 



Which one of the 
to J? 



following benches 



must be next 



(A) K 

(B) L 

(C) T 

(D) U 

(E) X 



16. For which one of the following benches are there two 
and no more than two locations either one of which 
could be the location the bench occupies? 

(A) K 

(B) T 

(C) X 

(D) Y 

(E) Z 



17. If Z is directly north of Y, which one of the 
following statements must be true? 



The pink benches stand next to one another along the 
park’s perimeter. 

No green bench stands next to a pink bench. 

The bench on the southeast corner is T. 

J stands at the center of the park’s north side. 

If T stands next to X, then T does not also stand 



(A) J is directly west of K. 

(B) K is directly east of U. 

(C) U is directly north of X. 

(D) X is directly south of J. 

(E) Z is directly south of J. 



next to L. 

13. Which one of the following benches could be on the 
northeast corner of the park? 

(A) Z 

(B) Y 

(C) X 

(D) T 

(E) L 



14. Each of the following statements must be true 
EXCEPT: 

(A) The bench on the northwest comer is pink. 

(B) The bench on the northeast comer is green. 

(C) The bench on the southwest comer is pink. 

(D) The middle bench on the east side of the park 

is green. 

(E) The middle bench on the west side of the park 

is pink. 



18. If Y is in the middle of the west side of the park, 
then the two benches in which one of the following 
pairs CANNOT be two of the corner benches? 

(A) K and X 

(B) K and Z 

(C) L and U ' 

(D) L and X 

(E) L and Z 

19. If Y is farther south than L and farther north than 
T, then the benches in each of the following pairs 
must be next to each other EXCEPT 

(A) J and L 

(B) K and T 

(C) T and X 

(D) U and Y 

(E) X and Z 



FIGURE 1 



Taken from LSAT, the Official Prep Test III, 1991, Vol. 2, Page 77: 

An Example Question. 



(a) J, K, and L are green; T and U are red; X, Y, and 

Z are pink. 

(b) The green benches stand next to one another along the 

park’s perimeter. 

(c) The pink benches stand next to one another along the 

park’s perimeter. 

w No green bench stands next to a pink bench. 

(e) The bench on the southeast corner is T. 

(f ) * stands at the center of the park’s north side. 

(g) If T stands next to X, then T does not also stand 

next to L. 
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FIGURE 2 



East 



East 



Modified LS AT Example to a Graded Response Item . 
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Normal Ogive Model; a g = 1.0 , b Xg = -1.50, -0.50,0.00,0.75, 1.25 
(Taken from ONR/RR-79-4: page 18) 



c\$ 




H 

& 

o 

hH 

fa 



CO 

cv 




Model A; a g = 1.0 , b Xg = -1.50, -0.50, 0.00, 0.75, 1.25 
(Taken from ONR/RR-79-4: page 20) 
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FIGURE 6 




Examples of Informative Detractors and Equivalent Detractors 
(Taken from ONR/RR-84-1: pages 43 and 55.) 
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SQRT OF TEST INFORMATION FUNCTION 




FIGURE 7 



Square Root of the Test Information Function of Each of the Five 
Hypothetical Tests of 30 Equivalent Items: the Normal Ogive Model. 
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FIGURE 8 



MLE Bias Function of Each of the Five Hypothetical Tests of 30 Equivalent 

Items: the Normal Ogive Model. 
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Warm’s 
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Relationships among the MLE 6 V , the Two Corrected MLE’s 
by Lord and Samejima, Respectively, and Warm’s WLE 6 V . 
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FIGURE 10 



Square Roots of the Test Information Function (Solid) and Its Two 
Modifications (Dashed & Dotted) of the Iowa Level 11 Vocabulary Subtest: 

the Logistic Model. 
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